IEOR 8100-001: Learning and Optimization for Sequential Decision Making 02/03/16 Lecture 5: Thomposon Sampling (part II): Regret bounds proofs

نویسنده

  • Shipra Agrawal
چکیده

We describe the main technical difficulties in the proof for TS algorithm as compared to the UCB algorithm. In UCB algorithm, the suboptimal arm 2 will be played at time t, if its UCB value is higher, i.e. if UCB2,t−1 > UCB1,t−1. If we have pulled arm 2 for some amount of times Ω( log(T ) ∆2 ), then with a high probability this will not happen. This is because after n2,t ≥ Ω(log(T )/∆), using concentration bounds we can derive that UCB2,t will be close to its true mean μ2. So that, with high probability.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning and Optimization for Sequential Decision Making 02 / 01 / 16 Lecture 4 : Thompson Sampling ( part 1 )

Consider the problem of learning a parametric distribution from observations. A frequentist approach to learning considers parameters to be fixed, and uses the data learn those parameters as accurately as possible. For example, consider the problem of learning Bernoulli distribution’s parameter ( a random variable is distributed as Bernoulli(μ) is 1 with probability μ and 0 with probability 1 −...

متن کامل

Improvement of Methanol Synthesis Process by using a Novel Sorption-Enhanced Fluidized-bed Reactor, Part II: Multiobjective Optimization and Decision-making Method

In the first part (Part I) of this study, a novel fluidized bed reactor was modeled mathematically for methanol synthesis in the presence of in-situ water adsorbent named Sorption Enhanced Fluidized-bed Reactor (SE-FMR) is modeled, mathematically. Here, the non-dominated sorting genetic algorithm-II (NSGA-II) is applied for multi-objective optimization of this configuration. Inlet temperature o...

متن کامل

The End of Optimism

Stochastic linear bandits are a natural and simple generalisation of finite-armed bandits with numerous practical applications. Current approaches focus on generalising existing techniques for finite-armed bandits, notably the optimism principle and Thompson sampling. Prior analysis has mostly focussed on the worst-case setting. We analyse the asymptotic regret and show matching upper and lower...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016